18 ◾ Bioinformatics
1.5 FASTQ READ QUALITY ASSESSMENT
The quality control for obtaining good sequencing data begins by isolation of pure nucleic
acid from the samples of interest. After isolation, the nucleic acid quality is usually assessed
before sequencing. The nucleic acid must be pure and have sufficient concentration as rec-
ommended by the kits used for library preparation. If the Nanodrop is used, the purity of
the nucleic acid is measured by the ratio of light absorbances at 260 and 280 nm. A ratio of
~1.8 is generally accepted as pure for DNA and a ratio of ~2.0 is generally accepted as pure
for RNA. Similarly, absorbance at 230 nm is usually due to other contamination. For pure
nucleic acid, the 260/230 value is often higher than the respective 260/280 value. The purity
of the RNA is better to be measured by the Bioanalyzer which provides the RNA integrity
(RIN). The RIN ranges from 1 to 10, where 10 is the best RNA integrity or non-degrading
RNA. Most commercially available RNA library preparation kits require an RNA of a RIN
value greater than 7 (RIN>7) for proper library construction.
Another quality control check point is performed after library preparation and before
sequencing. The quality of the nucleic acid libraries can also be assessed to ensure that
there are no remaining adaptor primers that may cause adaptor dimers in the sequenced
DNA fragments. The adaptor dimers are DNA produced from complete adaptor sequences
that can bind and cluster on the flow cell and generate contaminating reads, which nega-
tively impact the quality of sequencing data.
The quality assurances in the steps of nucleic acid isolation and removing of adaptor
primers after library preparation help in producing high-quality reads. However, errors
can also be generated during the sequencing. In Illumina instruments, errors in base call
may be made due to failure to terminate the synthesis (polymerase remains attached),
forming leading strands that are longer than normal, or due to failure to reverse synthesis
termination in the washing cycle, forming lagging strands. The base call software converts
fluorescence signals into actual sequence data with quality scores, which will be stored in
FASTQ files.
A post-sequencing quality check must be carried out to assess the read quality to ensure
that the sequence data looks good and there are no problems or biases that lead to inaccu-
rate or misleading results. FastQC [10] is the most popular program for assessing the qual-
ity of the sequencing data produced by high-throughput instruments. FastQC provides a
simple way to assess the quality of the raw data and generates summary and simple graphi-
cal reports that we can use to have a clear idea about the overall quality of the raw data and
any potential quality problem. FastQC can be downloaded for all platforms from “https://
www.bioinformatics.babraham.ac.uk/projects/fastqc/”. You can use the following steps to
install the latest FastQC version, as of this time, on Linux (Ubuntu):
Use “wget” command to download the current version; the current version is v0.11.9,
which may change in the feature.
wget https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
fastqc_v0.11.9.zip
Decompress the zipped file with unzip: